-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Add consensus-independent vat snapshot/transcript archiving configuration #10055
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of process, could we split and merge the first part of this PR that addresses #10054? It seems like it can stand on its own and is sufficiently different than the archive change. I have reviewed that set of commit and would approve such a PR.
@@ -257,18 +257,27 @@ export function makeTranscriptStore( | |||
WHERE vatID = ? AND position >= ? AND position < ? | |||
`); | |||
|
|||
function doSpanRollover(vatID, isNewIncarnation) { | |||
function closeSpan(vatID, bounds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment that closeSpan
does not note any changes to the .current
export and that the caller is in charge of that. An alternative would be to always note a deletion of that export when closing, but that's unnecessary churn for the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Deploying agoric-sdk with Cloudflare Pages
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, but please consider some of the suggested refactorings.
packages/internal/src/node/utils.js
Outdated
* @template T | ||
* @param {() => Promise<T>} trier | ||
* @param {(error?: unknown) => Promise<unknown>} finalizer | ||
* @returns {ReturnType<trier>} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use ReturnType
here instead of Promise<T>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong reason, just a way to emphasize that the happy-path result is just decoration of the function.
packages/internal/src/node/utils.js
Outdated
* | ||
* @template T | ||
* @param {( | ||
* addCleanup: (fn: (err?: unknown) => Promise<void>) => void, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think addCleanup
should be allowed to accept a function that returns something else than a promise, but the problem is that Promise<void> | void
has undesirable behavior. Maybe Promise<any> | any
, just for documentation purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be easier to just wrap with async () => { … }
, but we can cross that bridge if/when we get there. This is an internal async helper, and there's no current need for supporting non-async callbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting the asyncification in a standalone refactor
commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually started that way, but it turned out that only two of the changes were non-fundamental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "non-fundamental". I consider making a function and all its callers transitively async to be a pure refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that you want to see a commit which is just marking functions as async
and inserting await
at their call sites, with no point-in-time motivation for such a refactor until the following commit adds internal await
s, as opposed to a42c45f including both? I don't think I would find that useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my soft preference, yes. It's similar to moving some code around in a first commit in preparation for a subsequent commit which has a more minimal behavioral change.
import { resolve as pathResolve } from 'path'; | ||
import path from 'node:path'; | ||
import v8 from 'node:v8'; | ||
import process from 'node:process'; | ||
import fs from 'node:fs'; | ||
import fsPromises from 'node:fs/promises'; | ||
import { performance } from 'perf_hooks'; | ||
import { resolve as importMetaResolve } from 'import-meta-resolve'; | ||
import tmpfs from 'tmp'; | ||
import { performance } from 'node:perf_hooks'; | ||
import { fork } from 'node:child_process'; | ||
import { resolve as importMetaResolve } from 'import-meta-resolve'; | ||
import tmp from 'tmp'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting the rename of path
and tmp
in a standalone refactor
commit.
const store = initSwingStore(dbDir, { | ||
exportCallback, | ||
keepTranscripts, | ||
archiveTranscript, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the swingStore does extra / different work when archiveTranscript
is present, should this be a variation of the test?
Same for archiveSnapshot
, especially if we end up teeing the compression stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work is purely additive, so I figured it was fine to always check in these tests rather than multiply its configuration to the cross product of keepTranscripts={true,false} and archiveTranscript={undefined,fn}—plus there's lots of testing that covers initSwingStore
with archiveTranscript=undefined, including in this file. Do you feel strongly about making this a test variation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strongly no, but since we already have a macro, I thought it could be cheap enough. Not all permutations need to be tested (e.g. transcript options are fully independent of the snapshot option)
name: tmpName, | ||
fd, | ||
removeCallback, | ||
} = tmp.fileSync({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the sync version if we're in an async function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong reason; this step doesn't need to be async and honestly wouldn't benefit from it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering because I believe it means the removeCallback is sync, which may be unexpected (according to types) for the cleanup. I do remember that tmp
had some surprising changes in behavior between sync and async, but I'm not sure what those where now.
c5f7c6c
to
4dcdb4f
Compare
deferred.push(kernelKeeper.removeVatFromSwingStoreExports(vatID)); | ||
for (const kpid of deadPromises) { | ||
resolveToError(kpid, makeError('vat terminated'), vatID); | ||
} | ||
} | ||
await Promise.all(deferred); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to preserve the log ordering in terminate.test.js. Alternatively (or additionally), we could update bootstrap-terminate-with-presence.js to tolerate ordering variation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused about why the simple await
was not working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figuring it out was an interesting dive... vat-vat-admin.js terminateWithFailure
ends with a synchronous call to D(vatAdminDev).terminateWithFailure
, which through further synchronous calls in device-vat-admin.js → kernel/deviceManager.js → kernel.js → kernelSyscall.js → vat-admin-hooks.js
invokes void terminateVat(vatID, true, marshalledReason)
, which means that only the synchronous prelude of terminateVat
has an effect before the result promise of E(vatAdminNode).terminateWithFailure(reason)
settles and enqueues notifies.
Before this PR, kernelKeeper.removeVatFromSwingStoreExports
was synchronous and therefore could not impede the following step's rejection of promises yet to be decided by the terminating vat (which therefore preceded fulfillment of the terminateWithFailure
result promise).
But now that this PR is making removeVatFromSwingStoreExports
async, a naïve await
thereof would end the synchronous prelude before dead promise rejection, which is what caused the assertion in terminate.test.js to fail. Deferring the await
restores the expected ordering, although we may want to move it even further down to avoid bumping following logic out of the synchronous prelude, e.g.:
if (critical) {
// The following error construction is a bit awkward, but (1) it saves us
// from doing unmarshaling while in the kernel, while (2) it protects
// against the info not actually encoding an error, but (3) it still
// provides some diagnostic information back to the host, and (4) this
// should happen rarely enough that if you have to do a little bit of
// extra interpretive work on the receiving end to diagnose the problem,
// it's going to be a small cost compared to the trouble you're probably
// already in anyway if this happens.
panic(`critical vat ${vatID} failed`, Error(info.body));
} else if (vatAdminRootKref) {
// static vat termination can happen before vat admin vat exists
notifyTermination(
vatID,
vatAdminRootKref,
shouldReject,
info,
queueToKref,
);
} else {
console.log(
`warning: vat ${vatID} terminated without a vatAdmin to report to`,
);
}
// worker, if present, needs to be stopped (note that this only applies to ephemeral vats)
deferred.push(() => vatWarehouse.stopWorker(vatID));
await Promise.all(deferred);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually would want to go a step further and remove the async nature of this function, but still make it return a promise. However that would release zalgo somewhat, so we should not do that in a hurry now.
Please go ahead with the proposed change above, but use PromiseAllOrErrors
to avoid swallowing multiple errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10073 to track addressing this footgun
…n rollover (#10060) Fixes #10054 ## Description The first commits from #10055, [by request](#10055 (review)), culminating in a fix of #10054 by introducing a `closeSpan` helper to support both span rollover and `stopUsingTranscript`. ### Security Considerations None known. ### Scaling Considerations If anything, a negligible reduction in transcriptStore size. ### Documentation Considerations None known. ### Testing Considerations Covered by unit tests. ### Upgrade Considerations This is kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade).
4dcdb4f
to
e322e2b
Compare
…gateError Work around absence of AggregateError in xsnap.
4ff2fd7
to
bf76e00
Compare
Ref #10055 ## Description The internal `makeAggregateError` introduced by #10055 properly logged fake-AggregateError errors, but failed to expose them on the error object. But this is [already supported in endo](https://github.com/endojs/endo/blob/4406f5dde521d539a5effeb8ab68c1316e45261d/packages/ses/src/error/assert.js#L321), so we should use it. ### Security Considerations n/a ### Scaling Considerations n/a ### Documentation Considerations n/a ### Testing Considerations n/a ### Upgrade Considerations n/a
@@ -289,13 +392,13 @@ export function makeTranscriptStore( | |||
} | |||
} | |||
|
|||
function doSpanRollover(vatID, isNewIncarnation) { | |||
async function doSpanRollover(vatID, isNewIncarnation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this makes me sad. Swingstore wants to be synchronous for all the "normal" day-to-day operations. I'm ok with async for commit
and exports and things done by the host, but I'd really prefer that everything the kernel touches is synchronous.
I see that it happened to fit into the existing callers (they're already awaiting a snapshot write, or a vat eviction), so the contagion didn't have a big impact, but in general, any async operational API call raises concerns about what state the swingstore is in while that Promise is pending.
In this case, there's a period where the old span is closed, but the new span has not yet been created (we're sitting in the await closeSpan()
, which is sitting in its await archiveTranscript
). If someone calls into swingstore while that Promise is pending, we'll have no isCurrent
spans, and doing something like addItem
will throw (at best) or be very confused (at worst). With a synchronous interface, this invariant-violating state is not exposed.
As you know, I'm not a fan of async file IO, I'm not convinced that there's enough perf improvement (if any) to justify the complexity. But I do see that you didn't have a lot of choice.. the gzip streams used by makeArchiveTranscript
would be a PITA to use synchronously (but that just makes me sad about the gzip APIs we have available).
So, ok, but let's make it a goal to keep the API synchronous, and only allow exceptions like this when really necessary.
(sorry I missed this when the review was pending, I was OOO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries; after the fact reviews are still much appreciated! Should we open a new ticket about refactoring to isolate asynchronicity?
Fixes #10036
Description
Adds consensus-independent
vat-{snapshot,transcript}-archive-dir
cosmos-sdk swingset configuration for propagation in AG_COSMOS_INIT. When set, gzip-compressed vat snapshot/transcript files are written to the respective directory (the former when created, the latter when finalized by rollover).It also includes miscellaneous documentation and code improvements, most significantly a new helper for replacing
const cleanups = []; return aggregateTryFinally(async () => { … cleanups.push(…); … }, async () => { await PromiseAllOrErrors(cleanups.reverse().map(fn => Promise.resolve().then(() => fn()))); });
withreturn withDeferredCleanup(async addCleanup => { … addCleanup(…); … });
(i.e., folding in the cleanup registration and ordering).Best reviewed commit-by-commit.
Security Considerations
This provides a new interface by which the Swingset kernel can write to its host filesystem, albeit indirectly via callback. But as such, it is a potential vector for filling up a disk or possibly overwriting files, although subject to an operator's own bad configuration and/or elevated permissions.
Scaling Considerations
When one or both new settings are active and the disk is full or slow, performance may suffer. I considered performing these writes in the background, but decided that silent failures or slow writes would be worse than the new
await
s (and when active, the archive dir will likely be on the same filesystem as the database anyway).Documentation Considerations
I think I've covered what I need to, although there might be some other Markdown files worth updating.
Testing Considerations
New features are covered by unit tests.
Upgrade Considerations
This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.